Evaluate AI Model & System

Language modelling metrics

Entropy

Cross-entropy (Entropy#^c91fe9)

Bits-per-Character & Bits-per-Byte

Perplexity

Task-specific evaluation metrics

Metrics

Evaluation benchmarks

Evaluating open-ended responses

Exact evaluation: produces judgment without ambiguity

Subjective evaluation

Evaluating AI systems

Evaluation criteria

Model selection

Evaluation workflow at a high level:

  1. filter out models with hard attributes
  2. use publicly available information, e.g., public benchmark performance
  3. run experiments with task-specific evaluation
  4. continually monitor the model in production